October 2018

Getting Started

This document provides a very brief overview to basic progamming in R adapted from training provided by Prof. Ross Ihaka.

Further information can be found at https://www.stat.auckland.ac.nz/~ihaka/?Teaching

Topics include:

  • Setting Up
  • R 101: The Basics
  • Creating Plots
  • Writing Functions
  • R and Hilltop
  • Report Writing

Set up

To start we need to download R, as of Oct 2018 the latest version is 3.5.1 (Feather Spray). I also highly recommend downloading RStudio - an integrated development environment (IDE) for R.

R | RStudio

Starting to Code

Basic calculations

One way to think of R is as a very powerful calculator. If we type a calculation into the console an answer is printed out.

# Performing calculations using mathematical notation
2 + 2
## [1] 4
# Precedence can be set using parentheses
(10 + 5) / 2
## [1] 7.5

Once R has completed the computation the answer is discarded. To ‘save’ a result you can assign a name using <- or = to create a variable. Variables can then be used in subsequent expressions.

# Create a variable by assigning a name 'a'
a <- 5^2

Notice no result is printed

# Print the result named 'a'
a
## [1] 25
# Calculate the square root of 'a'
sqrt(a)
## [1] 5

Vectors

R operates on named data structures - the most basic of which is a vector. Vectors contain an indexed collection of values of the same type.

Vectors are usually created with the function c()

x <- c(1, 3, 5, 7, 9)
y <- c(TRUE, FALSE, TRUE)

Check the type of a variable

class(x)
## [1] "numeric"
class(y)
## [1] "logical"

Determine the number of elements in a vector

length(x)
## [1] 5
length(y)
## [1] 3

# Be careful of mixing types
c(1, 5.5, 7)
## [1] 1.0 5.5 7.0
c(1, TRUE, "Hello")
## [1] "1"     "TRUE"  "Hello"

Number sequences

Creating vectors by a sequence

5:10
## [1]  5  6  7  8  9 10
seq(1, 10, by = 2)
## [1] 1 3 5 7 9
-12.5:12
##  [1] -12.5 -11.5 -10.5  -9.5  -8.5  -7.5  -6.5  -5.5  -4.5  -3.5  -2.5
## [12]  -1.5  -0.5   0.5   1.5   2.5   3.5   4.5   5.5   6.5   7.5   8.5
## [23]   9.5  10.5  11.5

Vector arithmetic

x
## [1] 1 3 5 7 9
x + 10
## [1] 11 13 15 17 19
sqrt(x)
## [1] 1.000000 1.732051 2.236068 2.645751 3.000000
1:10/2
##  [1] 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Subsetting vectors

Values can be subset out a vector using square brackets

x
## [1] 1 3 5 7 9
x[3]
## [1] 5
x[2:4]
## [1] 3 5 7
x[-3]
## [1] 1 3 7 9

Conditional subsetting

x > 5
## [1] FALSE FALSE FALSE  TRUE  TRUE
x[x > 5]
## [1] 7 9

Named vectors

names(x) <- c("Jan", "Mar", "May", "Jul", "Sep")
x
## Jan Mar May Jul Sep 
##   1   3   5   7   9
x["May"]
## May 
##   5

Summarizing data

c(mean = mean(x), sd = sd(x))
##     mean       sd 
## 5.000000 3.162278
quantile(x)
##   0%  25%  50%  75% 100% 
##    1    3    5    7    9
cumsum(x)
## Jan Mar May Jul Sep 
##   1   4   9  16  25

Tip: To get information about a function try ?function name

Dataframes

Dataframes are one of the ways R presents tabular data.

A dataframe is comprised of columns and rows, both of which can be named, of the same length. Each column holds the same type of data.

rain <- read.csv("Data/MIL_DailyRainfall.csv")
rain$Date <- as.Date(rain$Date)

# Display the start or end of a dataframe with head() and tail()
head(rain)
##         Date Rainfall
## 1 2017-01-01      0.0
## 2 2017-01-02      1.0
## 3 2017-01-03     36.5
## 4 2017-01-04      3.5
## 5 2017-01-05      0.0
## 6 2017-01-06      0.0

Subsetting dataframes

Dataframes can be subset using [row, column]

rain[1, 1]
## [1] "2017-01-01"
rain[1,]
##         Date Rainfall
## 1 2017-01-01        0
head(rain[,1])
## [1] "2017-01-01" "2017-01-02" "2017-01-03" "2017-01-04" "2017-01-05"
## [6] "2017-01-06"

Building a PCAL dataframe

# Create a month column for calculating statistics
rain$Month <- factor(format(rain$Date, "%b"), levels = month.abb)

head(rain)
##         Date Rainfall Month
## 1 2017-01-01      0.0   Jan
## 2 2017-01-02      1.0   Jan
## 3 2017-01-03     36.5   Jan
## 4 2017-01-04      3.5   Jan
## 5 2017-01-05      0.0   Jan
## 6 2017-01-06      0.0   Jan

Note that you can call a column by name with $

Building a PCAL dataframe

# Calculate monthly total rainfall
rainTotal <- tapply(rain$Rainfall, rain$Month, sum)

# Create a dataframe with month and total rainfall
df <- data.frame(Month = month.abb,
                 Total = rainTotal)

head(df)
##     Month Total
## Jan   Jan  88.5
## Feb   Feb  77.0
## Mar   Mar  96.0
## Apr   Apr 192.5
## May   May 104.0
## Jun   Jun  32.5

Building a PCAL dataframe

# Add min, mean and max to the dataframe
pcal <- cbind(df, 
              Min = tapply(rain$Rainfall, rain$Month, min), 
              Mean = tapply(rain$Rainfall, rain$Month, mean), 
              Max = tapply(rain$Rainfall, rain$Month, max))

# Drop the rownames
row.names(pcal) <- 1:12


head(pcal)
##   Month Total Min     Mean  Max
## 1   Jan  88.5   0 2.854839 36.5
## 2   Feb  77.0   0 2.750000 29.5
## 3   Mar  96.0   0 3.096774 18.9
## 4   Apr 192.5   0 6.416667 45.5
## 5   May 104.0   0 3.354839 21.5
## 6   Jun  32.5   0 1.083333 13.0

Building a PCAL dataframe

# Add min, mean and max to the dataframe
pcal <- cbind(df, 
              Min = tapply(rain$Rainfall, rain$Month, min), 
              Mean = round(tapply(rain$Rainfall, rain$Month, mean), 2), 
              Max = tapply(rain$Rainfall, rain$Month, max))

# Drop the rownames
row.names(pcal) <- 1:12


head(pcal)
##   Month Total Min Mean  Max
## 1   Jan  88.5   0 2.85 36.5
## 2   Feb  77.0   0 2.75 29.5
## 3   Mar  96.0   0 3.10 18.9
## 4   Apr 192.5   0 6.42 45.5
## 5   May 104.0   0 3.35 21.5
## 6   Jun  32.5   0 1.08 13.0

Making plots

Base R plotting

R has built in plotting capabilities - to view more details about possible graphical parameters try ?plot

x <- data.frame(Month = 1:12, 
                Flow = c(25.4, 38.6, 27.3, 56.5, 53.4, 48.7, 
                         59.1, 62.1, 60.1, 69.8, 25.3, 13.5))

A basic plot

plot(x)

Improving the plot

plot(x, main = "Manawatu at Teachers College", pch = 16, col = 1:12, type = "b",
     xlab = "2017", ylab = "Avg Monthly Flow (cumecs)")

Adding more information to our plot

# Choosing colours for the points based on flow values
cols <- vector(length = 12)
cols[x$Flow < 30] <- "red"
cols[x$Flow > 60] <- "blue"
cols[x$Flow >= 30 & x$Flow <= 60] <- "black"

—-_

plot(x, main = "Manawatu at Teachers College Monthly Flow", xaxt = "n",
     xlab = "2017", ylab = "Avg Monthly Flow (cumecs)")
rect(6, 0, 9, 72, col = "lightgrey")
text(7.5, 45, "Winter")
lines(x$Month, x$Flow)
points(x, pch = 16, col = cols)
abline(h = 30, lty = 2)
abline(h = 60, lty = 2)
text(c(0, 20), "Low flows")
text(c(0, 65), "High flows")
axis(1, x$Month, month.abb)

Add a legend

par(oma=c(0, 0, 0, 5))

plot(x, main = "Manawatu at Teachers College Monthly Flow", xaxt = "n",
     xlab = "2017", ylab = "Avg Monthly Flow (cumecs)")
rect(6, 0, 9, 72, col = "lightgrey")
text(7.5, 45, "Winter")
lines(x$Month, x$Flow)
points(x, pch = 16, col = cols)
abline(h = 30, lty = 2)
abline(h = 60, lty = 2)
text(c(0, 20), "Low flows")
text(c(0, 65), "High flows")
axis(1, x$Month, month.abb)

legend(par('usr')[2], par('usr')[4], bty='n', xpd=NA, pch = 16, col = unique(cols),
       legend = c("Low flow", "Moderate flow", "High flow"))

Other libraries for plotting

There are a multitude of different packages for plotting. Some of the most popular include:

  • ggplot2

    ggplot2 is part of the tidyverse package collection. Very basically it works by mapping the aesthetics of a dataset and then adding layers, scales, facets, etc… to build up a plot. If you would like to learn more about tidyverse coding I highly recommend R for data science by Hadley Wickham.

library(ggplot2)

ggplot(data = mpg) + 
geom_point(mapping = aes(x = displ, y = hwy, color = class))

Other libraries for plotting

There are a multitude of different packages for plotting. Some of the most popular include:

  • ggplot2
  • htmlwidgets

    htmlwidgets are a collection of packages which provide an R interface to JavaScript libraries to create interactive visualisations. There are over 50 widgets available including dedicated options for mapping and timeseries. To see what widgets exist visit htmlwidgets for R

library(plotly)
plot_ly(data = iris, x = ~Sepal.Length, y = ~Petal.Length, color = ~Species, type = "scatter")

library(dygraphs)
dygraph(nhtemp, main = "New Haven Temperatures") %>% 
  dyRangeSelector(dateWindow = c("1920-01-01", "1960-01-01"))

library(leaflet)
leaflet() %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addMarkers(lng=174.768, lat=-36.852, popup="The birthplace of R")

Functions in R

Built in functions

Functions in R are denoted as: function name( )

Function parameters or arguments are separated by a comma and may be named.

We have used a number of functions already such as;

x <- c(1, 2, 3)

length(x)

plot(x, main = "A Plot", cols = x)

Tip: To view a function’s arguments try typing in function name() and hit tab while the cursor is inside the parentheses.

Vectorization

Vectorization is a fundamental principle in R. It refers to the ability to perform operations on a vector as a whole rather than element by element.

For example we can perform vector arithmetic;

x <- c(1, 2, 3)
y <- c(10, 20, 30)
x+y
## [1] 11 22 33

Or we can pass a vector to a function and receive a result for each element;

sqrt(x)
## [1] 1.000000 1.414214 1.732051

Recycling

The recycling rule is how R handles vector operations on vectors of different length. When the operation requires that both vectors be the same length R will automatically recycle the values of the shorter vector until both vectors are the same length.

For example;

x + 10
## [1] 11 12 13
x[c(TRUE, FALSE)]
## [1] 1 3

A Calculator Curiosity

This example is taken from the book Professor Stewart’s Hoard of Mathematical Treasures by Ian Stewart. The solution code was provided by Prof. Ross Ihaka.

(8 x 8) + 13

(8 x 88) + 13

(8 x 888) + 13

(8 x 8888) + 13

(8 x 88888) + 13

(8 x 888888) + 13

(8 x 8888888) + 13

(8 x 88888888) + 13

A Calculator Curiosity: A vectorised solution

The sequences of 8 can be decomposed into 8, 8 + 80, 8 + 80 + 800, ect…

Those values are the cummulative sum of the sequence 8, 80, 800, etc…

That sequence is equivalent to 8x10^0, 8x10^1, 8x10^2, etc…

So our code is as follows;

n = 8
cumsum(8 * 10^(1:n - 1))
## [1]        8       88      888     8888    88888   888888  8888888 88888888
8 * cumsum(8 * 10^(1:n - 1)) + 13
## [1]        77       717      7117     71117    711117   7111117  71111117
## [8] 711111117

Control flow

There are several control constructs available in R, most commonly;

if(condition){
  expression
}
if(condition){
  expression_1
}else{ 
  expression_2
}
for(variable in vector){
  expression
}

A Calculator Curiosity: A control flow solution

How can we solve the calculator curiosity problem using control flow?

The most logical method for solving our problem is using a for loop.

Again building up the 8, 88, 888… is key.

We can do this by multiplying the previous value by 10 and adding 8.

A Calculator Curiosity: A control flow solution

numSeq = 8

for (i in 1:8) {
  print(8 * numSeq + 13)
  
  numSeq <- 10 * numSeq + 8
}
## [1] 77
## [1] 717
## [1] 7117
## [1] 71117
## [1] 711117
## [1] 7111117
## [1] 71111117
## [1] 711111117

Writing your own functions

At some point you may want to start writing your own functions.

This can easily be acomplished in R using the structure

functionName <- function(arg_1, arg_2 = defaultValue){
  
                         Expressions to do something

                         return(results)
                }

You can then call your function with the default value

functionName(x)

or if you would like to overwrite the default value

functionName(x, arg_2 = y)

A Calculator Curiosity: A function

Now that we can calculate the problem, how can we make a function to output any number of answers?

Our vectorised version is already a simple expression using 1 input:

curiosity_1 <- function(n) 8 * cumsum(8 * 10^(1:n - 1)) + 13
curiosity_1(5)
## [1]     77    717   7117  71117 711117

A Calculator Curiosity: A function

Our control flow solution is a bit more complicated. We need some way to store the results instead of just printing them out.

curiosity_2 <- function(n){
  ans <- numeric(n)
  numSeq = 8

  for (i in 1:n) {
    ans[i] <- 8 * numSeq + 13
    
    numSeq <- 10 * numSeq + 8
  }
  return(ans)
}
curiosity_2(3)
## [1]   77  717 7117

Timeseries and Hilltop

Hilltop and R

There are a number of different ways to get Hilltop data into R, most of them rather laborious…

  • Export your data in Hilltop to csv and then read it into R.

  • Query your Hilltop server and decode the XML in R.

  • Use Mark Rodgers’ R package which connects Hilltop to R so you can read the data straight in.

Installing the Hilltop package

In order for the Hilltop package to work you must have Hydrolib.dll registered AND be using the 32-bit R version.

You can change your default R version in RStudio by going to Tools > Global Options

To install the package in RStudio:

  1. Tools > Install Packages…

  2. Select Package Archive File (.zip; .tar.gz) from the dropdown menu

  3. Navigate to your Hilltop installation (e.g. C:/hilltop) and select hilltop.zip

If hydrolib is not registered or you are using 64-bit R you will see the following

Error: package or namespace load failed for ‘Hilltop’: .onLoad failed in loadNamespace() for ‘Hilltop’

Using the Hilltop package

The Hilltop package is relatively straightforward. All of the functions are based on the idea that you

Connect to a Hilltop file using HilltopData()

library(Hilltop)
public <- "//ares/Telemetry/PublicTelemetry.hts"
myHTS <- HilltopData(public)

Find your site

head(SiteList(obj = myHTS, measurement = "Flow [Water Level]"))
##                                Site Easting Northing           StartTime
## 1 Forest Rd Drain at Drop Structure 2701900  6103500 2016-01-01 00:05:00
## 2             Hautapu at Alabasters 2748600  6168300 2016-04-15 00:00:00
## 3          Hokio at Lake Horowhenua 2699300  6064300 2016-01-01 00:05:00
## 4      Kahuterawa at Johnstons Rata 2732333  6080805 2016-02-05 00:00:00
## 5           Kai Iwi at Handley Road 2672600  6145500 2016-03-12 00:00:00
## 6            Kiwitea at Haynes Line 2736600  6120700 2016-02-05 00:00:00
##               EndTime
## 1 2018-10-30 17:40:00
## 2 2018-10-30 17:30:00
## 3 2018-10-30 17:50:00
## 4 2018-10-30 17:30:00
## 5 2018-10-30 17:30:00
## 6 2017-08-01 18:20:00

Pick a measurement

head(MeasurementList(obj = myHTS, SiteName = "Manawatu at Teachers College"))
##                                  Measurement            StartTime
## 1                        Stage [Water Level]  1-Jan-2016 00:05:00
## 2                         Flow [Water Level] 10-Feb-2016 00:00:00
## 3                Mean Velocity [Water Level]  1-Jan-2016 00:05:00
## 4               Return Periods [Water Level]  1-Jan-2016 00:05:00
## 5 Flow Distribution Percentile [Water Level]  1-Jan-2016 00:05:00
## 6           Flow Mean (1 Hour) [Water Level] 10-Feb-2016 00:00:00
##                EndTime       Units InterpolationMethod         DataType
## 1 30-Oct-2018 17:30:00          mm             Instant SimpleTimeSeries
## 2 30-Oct-2018 17:30:00         l/s             Instant SimpleTimeSeries
## 3 30-Oct-2018 17:30:00         m/s             Instant SimpleTimeSeries
## 4 30-Oct-2018 17:30:00 ARI (Years)             Instant SimpleTimeSeries
## 5 30-Oct-2018 17:30:00           %             Instant SimpleTimeSeries
## 6 30-Oct-2018 17:00:00         l/s             Instant SimpleTimeSeries
##      TSType
## 1 StdSeries
## 2 StdSeries
## 3 StdSeries
## 4 StdSeries
## 5 StdSeries
## 6 StdSeries

Get some metadata

head(SiteInfo(obj = myHTS, SiteName = "Manawatu at Teachers College"))
##           Syn1           Syn2     Short Name       Altitude Catchment Area 
##             ""          "TCO"     "Hor-0034"       "21.209"      "3900.00" 
##        Easting 
##      "2733100"

If you want more metadata than what is provided with SiteInfo you need to connect directly to the Hilltop SQL database using the RODBC package.

library(RODBC)

ch <- odbcDriverConnect('driver={SQL Server};server=hilltopdb;database=Hilltop;
                        trusted_connection=true')
  sqlData <- sqlQuery(ch, paste("SELECT [Table Column Name],[Table Column Name] FROM", 
                                [Table Name]))
close(ch)

Pull data

flow <- GetData(obj = myHTS, siteName = "Manawatu at Teachers College", 
                measurement = "Flow [Water Level]", startTime = "1-10-2018", 
                endTime = "", interval = "1 day", gapTolerance = "3 hours")

Hilltop reads data into R as a zoo object - a time series object. Zoos work by having a unique time index which can then be paired with any number of values.

str(flow)
## 'zoo' series from 2018-10-02 to 2018-10-30
##   Data: num [1:29] 41991 48377 56507 42984 37470 ...
##  - attr(*, "SiteName")= chr "Manawatu at Teachers College"
##  - attr(*, "Measurement")= chr "Flow"
##  - attr(*, "Units")= chr "l/s"
##  - attr(*, "InterpolationMethod")= chr "Instant"
##  - attr(*, "DataType")= chr "SimpleTimeSeries"
##  - attr(*, "TSType")= chr "StdSeries"
##   Index:  POSIXct[1:29], format: "2018-10-02" "2018-10-03" "2018-10-04" "2018-10-05" "2018-10-06" ...

Because this is a native time series object the package forces some handy time series rules to apply. For example, if you want to combine time series then the index (time) must be unique and it will automatically arrange the series in increasing time order.

stage <- GetData(obj = myHTS, siteName = "Manawatu at Teachers College", 
                measurement = "Stage [Water Level]", startTime = "1-10-2018", 
                endTime = "", interval = "1 day", gapTolerance = "3 hours")

head(merge(flow, stage))
##                         flow stage
## 2018-10-02 13:00:00 41991.08   626
## 2018-10-03 13:00:00 48376.84   683
## 2018-10-04 13:00:00 56507.14   752
## 2018-10-05 13:00:00 42984.00   635
## 2018-10-06 13:00:00 37470.29   584
## 2018-10-07 13:00:00 34457.27   555

You can also pass it to plot functions which recognize timeseries

dygraph(flow)

Disconnect from Hilltop

disconnect(myHTS)

Time in R

Whenever you are playing with timeseries be EXTRA careful.

R does not handle time very well. Some functions, particularly ‘bind’ or merge operations, will convert timeseries back into the local timezone so always compare your results to Hilltop!

Time operations also tend to be finicky. I highly recommend the package lubridate when working with time!

In general you will use 2 time flavours - as.Date() for date information and as.POSIXct() for date-time information.

Try the following;

Sys.Date()

Sys.time()

Reporting with rmarkdown

R Markdown

R Markdown is R’s flavour of markdown which is a light weight markup language to allow plan text to be formatted on conversion to HTML or other formats.

Wikipedia’s example here demonstrates how markdown syntax compares to HTML formatting and how it looks when rendered.

Rmarkdown allows for the creation of fully reproducible documents which can save, execute, and render R code. Rmarkdown files have the .Rmd extension.

The key components of rmarkdown are;

  • A YAML header surrounded by - - - at the top of the page - this is where you choose your output.

  • R code chunks surrounded by ``` - code inside of these chunks will execute as normal.

  • Text with markdown formatting.